Group Project Team 4 - MGMT655

1. Business Question

Problem Statement

  • This is an analysis to predict whether a patient has a high risk of heart disease based on the variables available.

  • This prediction analysis is very important for insurance industry to know the condition of the patient before signing the insurance contract and for health care industry to detect as early as possible whether the patient is at risk of heart disease.

We start by activating all of our packages.

# Load necessary packages
pacman::p_load(tidyverse, lubridate, tidymodels, skimr, GGally, ggstatsplot,
               usemodels, janitor, doParallel,ggthemes, ggthemr, plotly, vip,
               shiny, shinydashboard, DT, caret,jtools, interactions,huxtable,
               Hmisc, broom, ggstatsplot, glue)

2. Import

2.1. Variables in the Dataset

Variable (Feature) Name Description
age Age of the person
sex Sex of the person (1, 0)
cp Chest Pain Type (1,2,3,4)
trtbps Resting blood pressure (in mmHg)
chol cholesterol (in mmHg)
fbs fasting blood sugar >120 mg/dl (1 = True, 0 = False)
restecg resting electrocardiography results (0, 1, 2)
thalachh maximum heart rate achieved
exng Exercise induced angina (1 = Yes, 0 = No)
oldpeak Previous peak
slp slope (0,1,2)
caa number of major vessels (0,1,2,3)
thall thal rate
output The risk of heart attack (1 = Yes, 0 = No)

2.2. Data Source

Data was originally sourced from Kaggle

Importing the data

heart <- read_csv("heart.csv") 

3. Transform & EDA

3.1. Exploratory Data Analysis

We start by exploring our data using glimpse and skim

glimpse(heart)
## Rows: 303
## Columns: 14
## $ age      <dbl> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58, 5~
## $ sex      <dbl> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1~
## $ cp       <dbl> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3, 0~
## $ trtbps   <dbl> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130, 1~
## $ chol     <dbl> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275, 2~
## $ fbs      <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0~
## $ restecg  <dbl> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1~
## $ thalachh <dbl> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139, 1~
## $ exng     <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0~
## $ oldpeak  <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2, 0~
## $ slp      <dbl> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2, 1~
## $ caa      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0~
## $ thall    <dbl> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3~
## $ output   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
skim(heart)
Data summary
Name heart
Number of rows 303
Number of columns 14
_______________________
Column type frequency:
numeric 14
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 54.37 9.08 29 47.5 55.0 61.0 77.0 ▁▆▇▇▁
sex 0 1 0.68 0.47 0 0.0 1.0 1.0 1.0 ▃▁▁▁▇
cp 0 1 0.97 1.03 0 0.0 1.0 2.0 3.0 ▇▃▁▅▁
trtbps 0 1 131.62 17.54 94 120.0 130.0 140.0 200.0 ▃▇▅▁▁
chol 0 1 246.26 51.83 126 211.0 240.0 274.5 564.0 ▃▇▂▁▁
fbs 0 1 0.15 0.36 0 0.0 0.0 0.0 1.0 ▇▁▁▁▂
restecg 0 1 0.53 0.53 0 0.0 1.0 1.0 2.0 ▇▁▇▁▁
thalachh 0 1 149.65 22.91 71 133.5 153.0 166.0 202.0 ▁▂▅▇▂
exng 0 1 0.33 0.47 0 0.0 0.0 1.0 1.0 ▇▁▁▁▃
oldpeak 0 1 1.04 1.16 0 0.0 0.8 1.6 6.2 ▇▂▁▁▁
slp 0 1 1.40 0.62 0 1.0 1.0 2.0 2.0 ▁▁▇▁▇
caa 0 1 0.73 1.02 0 0.0 0.0 1.0 4.0 ▇▃▂▁▁
thall 0 1 2.31 0.61 0 2.0 2.0 3.0 3.0 ▁▁▁▇▆
output 0 1 0.54 0.50 0 0.0 1.0 1.0 1.0 ▇▁▁▁▇
heart %>% 
  count(output)
outputn
0138
1165

Observation:

  • There are total 14 variables. We want to predict the column “output”
  • We don’t have any missing value
  • All columns in format. But some of the variables are categorical variable so that we need to convert it into factor later
  • The negative output is 138 and the positive output is 165. It is quite balance so that we don’t need downsampling here. Simple stratified should be just fine
  • The mean and SD for numeric predictors need to be normalized.
  • caa and thall have little value on some categories, we want to try out to combine several columns in the variable in one of our recipe.

3.2. Transforming Data

heart <- heart %>% 
  mutate(thall_2 = thall, caa_2 = caa) 

heart$thall_2 <- replace(heart$thall_2, heart$thall_2<1, 1)
heart$caa_2 <- replace(heart$caa_2, heart$caa_2>2,2)
heart <- heart %>% 
  complete() %>% 
  dplyr::mutate_all(as.factor) %>% 
  mutate(across(c(age,trtbps,chol,thalachh,oldpeak),as.numeric),
         output = forcats::fct_relevel(output,"1"))
# The goals are to find out whether the variables are already set into the correct class
# and whether the mutate for thall and caa is well done.
skim(heart)
Data summary
Name heart
Number of rows 303
Number of columns 16
_______________________
Column type frequency:
factor 11
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
sex 0 1 FALSE 2 1: 207, 0: 96
cp 0 1 FALSE 4 0: 143, 2: 87, 1: 50, 3: 23
fbs 0 1 FALSE 2 0: 258, 1: 45
restecg 0 1 FALSE 3 1: 152, 0: 147, 2: 4
exng 0 1 FALSE 2 0: 204, 1: 99
slp 0 1 FALSE 3 2: 142, 1: 140, 0: 21
caa 0 1 FALSE 5 0: 175, 1: 65, 2: 38, 3: 20
thall 0 1 FALSE 4 2: 166, 3: 117, 1: 18, 0: 2
output 0 1 FALSE 2 1: 165, 0: 138
thall_2 0 1 FALSE 3 2: 166, 3: 117, 1: 20
caa_2 0 1 FALSE 3 0: 175, 1: 65, 2: 63

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 21.38 8.94 1 14.5 22 28.0 41 ▃▅▇▆▂
trtbps 0 1 22.86 10.49 1 15.0 23 29.0 49 ▃▅▇▃▂
chol 0 1 74.03 38.50 1 43.0 71 103.5 152 ▅▇▇▆▅
thalachh 0 1 50.16 21.19 1 34.5 53 66.0 91 ▂▃▆▇▃
oldpeak 0 1 10.82 10.34 1 1.0 9 17.0 40 ▇▃▂▂▁
glimpse(heart)
## Rows: 303
## Columns: 16
## $ age      <dbl> 30, 4, 8, 23, 24, 24, 23, 11, 19, 24, 21, 15, 16, 31, 25, 17,~
## $ sex      <fct> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1~
## $ cp       <fct> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3, 0~
## $ trtbps   <dbl> 32, 23, 23, 15, 15, 29, 29, 15, 44, 35, 29, 23, 23, 9, 35, 15~
## $ chol     <dbl> 65, 81, 36, 68, 146, 26, 117, 93, 32, 10, 70, 104, 96, 43, 11~
## $ fbs      <fct> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0~
## $ restecg  <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1~
## $ thalachh <dbl> 50, 85, 72, 77, 63, 48, 53, 73, 62, 74, 60, 39, 71, 44, 62, 5~
## $ exng     <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0~
## $ oldpeak  <dbl> 23, 33, 15, 9, 7, 5, 14, 1, 6, 17, 13, 3, 7, 18, 11, 17, 1, 2~
## $ slp      <fct> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2, 1~
## $ caa      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0~
## $ thall    <fct> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3~
## $ output   <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ thall_2  <fct> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3~
## $ caa_2    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0~

3.3. Exploring Relationship Among Variables

We would like to see the relationships among numeric variables

heart %>% 
  ggcorrmat(colors = c("red",
                       "white",
                       "green"))

From the correlation matrix, it seems that there is no significant correlation among variables. We also can see the patterns among numerical variables using ggpair()

ggthemr("fresh")
heart %>% 
  select(output, where(is.numeric)) %>% 
  ggpairs()

From the ggpair, we don’t see any unique relationship among variables so we can proceed to the next step. Now, we start to observing relationship among categorical variables and output.

heart %>% 
  ggbarstats(x = sex, y = output)

heart %>% 
  ggbarstats(x = cp, y = output)

heart %>% 
  ggbarstats(x = fbs, y = output)

heart %>% 
  ggbarstats(x = restecg, y = output)

heart %>% 
  ggbarstats(x = exng, y = output)

heart %>% 
  ggbarstats(x = slp, y = output)

heart %>% 
  ggbarstats(x = caa, y = output)

heart %>% 
  ggbarstats(x = thall, y = output)

4. Predictive Modelling

After taking a glance in our data, we start with the predictive modelling.

4.1. Split

set.seed(210725)
heart_split <- heart %>% 
  initial_split(prop = .75,
                strata = "output")

class(heart_split)
## [1] "mc_split" "rsplit"
heart_training <- heart_split %>%  training()
heart_testing <- heart_split %>%  testing()

Splitting here is pretty straightforward. We split the data into two parts, training and testing data. We use stratified sampling for output in order to get the balance output per category of output (0 or 1).

4.2. Pre-Process

Feature engineering of the data.

recipe1 <- recipe(output~age+sex+cp+thalachh+exng,
                  data = heart_training) %>% 
  step_center(all_numeric_predictors()) %>% 
  step_scale(all_numeric_predictors()) 


recipe2 <- recipe(output~., data = heart_training) %>%
  step_rm(caa,thall, restecg, fbs) %>% 
  step_normalize(all_numeric_predictors())

Recipe one is basically the trial. Recipe two is based on first thorough analysis of the data

4.3. Fit

We choose random forest (trees = 1000) to run all of our models.

RF_model <- rand_forest() %>% 
  set_args(mtry = tune(),
           trees =1000) %>% 
  set_engine("ranger",
             importance = "impurity") %>%
  set_mode("classification")

4.4. Tuning

Workflow

RF_workflow_1<- workflows::workflow() %>% 
  add_recipe(recipe1) %>% 
  add_model(RF_model)

RF_workflow_2 <- workflows::workflow() %>% 
  add_recipe(recipe2) %>% 
  add_model(RF_model) 

Cross-Validation

set.seed(210725)
heartcv10 <- vfold_cv(heart_training, 10)

Parallel Processing 1

doParallel::registerDoParallel()

set.seed(210725)
heart_tuned_1 <- RF_workflow_1 %>% 
  tune_grid(resamples = heartcv10,
            grid = 10)

Parallel Processing 2

set.seed(210725)
heart_tuned_2 <- RF_workflow_2%>% 
  tune_grid(resamples = heartcv10, 
            grid = 10)

Collect Metrics

heart_tuned_results_1 <- heart_tuned_1 %>% 
  collect_metrics() 
heart_tuned_results_2 <- heart_tuned_2 %>%
  collect_metrics()

Selecting Best Metrics

parameters_heart_tuned_1 <- heart_tuned_1%>% 
  select_best(metric = "roc_auc")
parameters_heart_tuned_2 <- heart_tuned_2 %>%
  select_best(metric = "roc_auc")

Finalize Workflow

finalized_workflow_heart_1 <- RF_workflow_1 %>%
  finalize_workflow(parameters_heart_tuned_1)
finalized_workflow_heart_2 <- RF_workflow_2 %>%
  finalize_workflow(parameters_heart_tuned_2)

Last Fit

fit_heart_1 <- finalized_workflow_heart_1 %>% 
  last_fit(heart_split) 
fit_heart_2 <- finalized_workflow_heart_2 %>% 
  last_fit(heart_split) 

4.5. Assess

We would like to assess the model by the metrics (roc_auc, accuracy, f_meas, precision, recall). The model is visualized by confusion matrix and roc_auc curve, compared with based model. Lastly, we also assess the variable importance of each model.

Collecting Metrics and Predictions

performance_heart_1 <- fit_heart_1 %>% collect_metrics()
predictions_heart_1 <- fit_heart_1 %>% collect_predictions()
performance_heart_2 <- fit_heart_2 %>% collect_metrics()
predictions_heart_2 <- fit_heart_2 %>% collect_predictions()

Confusion Matrix

predictions_heart_1 %>% 
  conf_mat(truth = output, 
           estimate = .pred_class) %>% 
  pluck(1) %>% 
  as_tibble() %>%
  mutate(cm_colors = ifelse(Truth == 1 & Prediction == 1, "True Positive",
                            ifelse(Truth == 1 & Prediction == 0, "False Negative",
                                   ifelse(Truth == 0 & Prediction == 1, "False Positive",
                                          "True Negative")))) %>% 
  ggplot(aes(x = Prediction, y = Truth)) + 
  geom_tile(aes(fill = cm_colors)) +
  scale_fill_manual(values = c("True Positive" = "green3",
                               "True Negative" = "green1",
                               "False Positive" = "tomato3",
                               "False Negative" = "tomato1")) + 
  geom_text(aes(label = n), color = "white", size = 10) + 
  geom_label(aes(label = cm_colors), vjust = 2) +
  labs(title = "Confusion Matrix Model 1") +
  ggthemes::theme_fivethirtyeight() + 
  theme(axis.title = element_text(),
        legend.position = "none")

predictions_heart_2 %>% 
  conf_mat(truth = output, 
           estimate = .pred_class) %>% 
  pluck(1) %>% 
  as_tibble() %>%
  mutate(cm_colors = ifelse(Truth == 1 & Prediction == 1, "True Positive",
                            ifelse(Truth == 1 & Prediction == 0, "False Negative",
                                   ifelse(Truth == 0 & Prediction == 1, "False Positive",
                                          "True Negative")))) %>% 
  ggplot(aes(x = Prediction, y = Truth)) + 
  geom_tile(aes(fill = cm_colors)) +
  scale_fill_manual(values = c("True Positive" = "green3",
                               "True Negative" = "green1",
                               "False Positive" = "tomato3",
                               "False Negative" = "tomato1")) + 
  geom_text(aes(label = n), color = "white", size = 10) + 
  geom_label(aes(label = cm_colors), vjust = 2) +
  labs(title = "Confusion Matrix Model 2") +
  ggthemes::theme_fivethirtyeight() + 
  theme(axis.title = element_text(),
        legend.position = "none")

Roc_Auc Model

Creating Null Model

baseline_model <- null_model() %>%
  set_engine("parsnip") %>%
  set_mode("classification")

baseline_workflow_1 <- workflow() %>%
  add_recipe(recipe1) %>%
  add_model(baseline_model) %>%
  fit_resamples(heartcv10,
                control = control_resamples(save_pred = T)
  )

performance_BASELINE_1 <- baseline_workflow_1 %>% collect_metrics()
predictions_BASELINE_1 <- baseline_workflow_1 %>% collect_predictions()

Adding algorithm columns for each model and combining it.

predictions_heart_1 <- predictions_heart_1 %>% 
  mutate(algorithm = "RF 1")

predictions_heart_2 <- predictions_heart_2 %>% 
  mutate(algorithm = "RF 2")

predictions_BASELINE_1  <- predictions_BASELINE_1 %>% 
  mutate(algorithm = "NULL Model")

comparing_predictions_1 <- bind_rows(predictions_heart_1, predictions_heart_2,
                                   predictions_BASELINE_1)

Creating roc_auc curve to see the performance of each model.

comparing_predictions_1 %>%
  group_by(algorithm) %>%
  roc_curve(truth = output, 
            .pred_1) %>%
  autoplot() +
  ggthemes::scale_color_fivethirtyeight() +
  labs(title = "Comparions of Predictive Power\nbetween Random Forest & NULL Model\nin in Predicting Heart Attack on Patients",
       subtitle = "Random Forest\nPerforms Better in Prediction",
       x = "Sensitivity (Recall)",
       y = "1 - Specificity (False Positive Rate)",
       color = "Prediction Tools") +
  theme(legend.position = c(.65, .25))

Model 2 is much better than Model 1.

Metrics

Finding the metrics and comparing both model.

accuracy_1 <- predictions_heart_1 %>%
  metrics(output, .pred_class) %>%
  select(-.estimator) %>%
  filter(.metric == "accuracy") %>% 
  rename(accuracy = .estimate)

roc_auc_1 <- performance_heart_1[2,3] %>% 
  rename(roc_auc = .estimate)

Fmeas_1 <- predictions_heart_1 %>%
  f_meas(output, .pred_class) %>%
  select(-.estimator) %>% 
  rename(F_Measure = .estimate)


Result_1 <- tibble(accuracy_1[,2],
                   roc_auc_1[,1],
                   Fmeas_1[,2],
                   "precision" = yardstick::precision(predictions_heart_1, output, .pred_class) %>%
                     select(.estimate),
                   "recall" = yardstick::recall(predictions_heart_1, output, .pred_class) %>%
                     select(.estimate)
) %>%
  unnest()

accuracy_2 <- predictions_heart_2 %>%
  metrics(output, .pred_class) %>%
  select(-.estimator) %>%
  filter(.metric == "accuracy") %>% 
  rename(accuracy = .estimate)

roc_auc_2 <- performance_heart_2[2,3] %>% 
  rename(roc_auc = .estimate)

Fmeas_2 <- predictions_heart_2 %>%
  f_meas(output, .pred_class) %>%
  select(-.estimator) %>% 
  rename(F_Measure = .estimate)


Result_2 <- tibble(accuracy_2[,2],
                   roc_auc_2[,1],
                   Fmeas_2[,2],
                   "precision" = yardstick::precision(predictions_heart_2, output, .pred_class) %>%
                     select(.estimate),
                   "recall" = yardstick::recall(predictions_heart_2, output, .pred_class) %>%
                     select(.estimate)
) %>%
  unnest()

Result_summary <- bind_rows(round(Result_1, digits = 2), round(Result_2, digits = 2)) %>% 
  datatable()

Result_summary

Model 2 is better than Model 1 ! This is proving the significance of variable caa and thall in improving the performance of the model.

Variable Importance

finalized_model_1 <- finalized_workflow_heart_1 %>% fit(heart)

model_summary_1 <- pull_workflow_fit(finalized_model_1)$fit
model_summary_1
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~1L,      x), num.trees = ~1000, importance = ~"impurity", num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  1000 
## Sample size:                      303 
## Number of independent variables:  5 
## Mtry:                             1 
## Target node size:                 10 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.1568953
feature_importance_1 <- data.frame(importance = model_summary_1$variable.importance) %>% 
  rownames_to_column("feature") %>% 
  as_tibble() %>% 
  mutate(feature = as.factor(feature))

feature_importance_1 %>% 
  ggplot(aes(x = importance, y = reorder(feature, importance), fill = importance)) +
  geom_col(show.legend = F) +
  scale_fill_gradient(low = "deepskyblue1", high = "deepskyblue4") +
  scale_x_continuous(expand = c(0, 0)) +
  labs(
    y = NULL,
    title = "Feature (Variable) Importance Model 1",
    subtitle = "cp is the most important variable") + 
  ggthemes::theme_fivethirtyeight()

finalized_model_2 <- finalized_workflow_heart_2 %>% fit(heart)

model_summary_2 <- pull_workflow_fit(finalized_model_2)$fit
model_summary_2
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~2L,      x), num.trees = ~1000, importance = ~"impurity", num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  1000 
## Sample size:                      303 
## Number of independent variables:  11 
## Mtry:                             2 
## Target node size:                 10 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.1251124
feature_importance_2 <- data.frame(importance = model_summary_2$variable.importance) %>% 
  rownames_to_column("feature") %>% 
  as_tibble() %>% 
  mutate(feature = as.factor(feature))

feature_importance_2 %>% 
  ggplot(aes(x = importance, y = reorder(feature, importance), fill = importance)) +
  geom_col(show.legend = F) +
  scale_fill_gradient(low = "deepskyblue1", high = "deepskyblue4") +
  scale_x_continuous(expand = c(0, 0)) +
  labs(
    y = NULL,
    title = "Feature (Variable) Importance Model 2",
    subtitle = "cp is still the most important variable") + 
  ggthemes::theme_fivethirtyeight()

5. Executive Summary

The executive summaries are divided into 3 parts: evidence, interpretation, and recommendations.

5.1. Evidence

Model 2 has higher prediction power than Model 1 as can be seen in the metrics, roc_auc curve and confusion matrix. Model 1 and 2 are better than null model as can be seen in the roc_auc curve. Model 2 has 84% accuracy, 95% roc_auc, 87% F-measure, 81% precision, and 93% recall. We found that cp, thall, caa, oldpeak, thalachh have more significant importance for predicting heart attack risk.

5.2. Interpretation

Chest pain (cp) is the most direct and important variable to predict heart attack risk. When a patient have chest pain type 1-3, the patient will have high risk of heart attack. Thal rate (thal) is the second important variable to predict heart attack risk. Type 2 of thal rate has higher risk on heart attack. Number of major vessels (caa) is the third important variable to predict heart attack risk. When a patient don’t have any caa (caa = 0), the patient will have higher risk to get heart attack.

5.3. Recommendations

Using this prediction model, insurance company can make more accuracy prediction of heart attack on applicant. When applying for insurance the company should especially require the applicant to have a medical check up to check applicant’s chest pain (cp), thall rate (thall), caa (number of major vessels), previous peak (oldpeak), maximum heart rate achieved (thalachh). With this predictive modelling, the insurance company can do price adjustment for premium of different customers, thus improving their benefits.

Limitations

For Recipe 1 we just put the variables without taking many considerations. We create thorough analysis for the Recipe 2 in order to achieve highest possible result. This model can be improved further by observing interactions among variables.